RmsdXNA: RMSD prediction of nucleic acid-ligand docking poses using machine-learning method

Abstract Small molecule drugs can be used to target nucleic acids (NA) to regulate biological processes. Computational modeling methods, such as molecular docking or scoring functions, are commonly employed to facilitate drug design. However, the accuracy of the scoring function in predicting the closest-to-native docking pose is often suboptimal. To overcome this problem, a machine learning model, RmsdXNA, was developed to predict the root-mean-square-deviation (RMSD) of ligand docking poses in NA complexes. The versatility of RmsdXNA has been demonstrated by its successful application to various complexes involving different types of NA receptors and ligands, including metal complexes and short peptides. The predicted RMSD by RmsdXNA was strongly correlated with the actual RMSD of the docked poses. RmsdXNA also outperformed the rDock scoring function in ranking and identifying closest-to-native docking poses across different structural groups and on the testing dataset. Using experimental validated results conducted on polyadenylated nuclear element for nuclear expression triplex, RmsdXNA demonstrated better screening power for the RNA-small molecule complex compared to rDock. Molecular dynamics simulations were subsequently employed to validate the binding of top-scoring ligand candidates selected by RmsdXNA and rDock on MALAT1. The results showed that RmsdXNA has a higher success rate in identifying promising ligands that can bind well to the receptor. The development of an accurate docking score for a NA–ligand complex can aid in drug discovery and development advancements. The code to use RmsdXNA is available at the GitHub repository https://github.com/laiheng001/RmsdXNA.


INTRODUCTION
Therapeutic treatments have typically focused on modifying protein targets [1,2], but there is a growing acknowledgment of the critical role of ribonucleic acid (RNA) and deoxyribonucleic acid (DNA) molecules in numerous biological processes and disease pathways [3][4][5][6].Dysfunctions in RNA and DNA that can cause diseases [7][8][9], such as cancer, neurological problems and viral infections [10,11], can be treated using small chemicals.As our understanding of NA increases, especially with developments in computational methodologies [12,13] and structure-solving technologies [14], targeting NA for drug development shows promise in the field of molecular medicine.
Currently, however, solving the structure of NA complexes using experimental methods remains a challenge.As an alternative, computational methods, including molecular docking, can enhance the understanding of NA complexes and expedite the development of robust drug discovery models [15][16][17].Recent development of such docking tools for NA-ligand complexes includes rDock [18], NLDock [19] and RLDOCK [20].Molecular docking involves predicting the binding orientation and affinity between a small molecule (ligand) and a target protein or NA (receptor).Its precision depends on the reliability of the scoring functions [21,22], which is crucial for saving time and cost for drug discovery [23,24].However, accurate modelling of NA structures remains challenging because of its complexity and f lexibility [25].While the MM-PBSA [26] and free-energy perturbation methods [27] can accurately select stronger ligand binders compared to docking scoring functions, they are time-consuming and not suitable for high-throughput docking or virtual screening (VS) applications.
Integrating machine learning (ML) into docking algorithms can improve the accuracy and efficiency of computational tools [28][29][30].Numerous ML algorithms that outperform a broad range of classical scoring functions have been successfully utilized in protein systems [31].However, these scoring functions often cannot be applied directly to NA complexes [32,33].The limited availability of high-quality training datasets for NA complex structures poses a significant obstacle to the generation of accurate and robust models [34].Hence, the exploration of alternative feature extraction and ML algorithms methods is essential for addressing these challenges.
Although not as common as protein-ligand systems, two recent ML-based scoring functions have been developed to improve scoring functions for RNA-ligand interactions.RNAPosers [35] is trained on 80 RNA-ligand complexes to classify if the ligand pose is native-like.On the other hand, AnnapuRNA [36] is trained on 131 RNA-ligand complexes to develop a scoring function for predicting binding poses.Both methods demonstrated improved performance compared to existing tools for pose identification.However, the small dataset used for training the models can lead to overfitting and hinder the model's performance for unseen docking scenarios.In addition, both methods use interaction features involving unmodified RNA residues only, neglecting crucial interactions involving modified atoms, which can alter the chemical properties of NA [37].A diversified model should also be applicable to some important NA-ligand complexes, such as metal complex [38] or short peptide ligands [39], and RNA-DNA hybrid receptors [40].Developing a ML-based scoring function suitable for NAs (both DNA and RNA) and structures with modified residues seems feasible due to their structural similarity.
In this work, a new ML-based regression model, RmsdXNA, was developed for accurately predicting the RMSD of the ligand poses for NA-ligand complex, which can also serve as the docking score.Using a larger dataset of 980 PDB structures than previous studies aims to improve generalizability.As the number of NA structures is still relatively small compared to proteins, using similar distance-based fingerprints from Wang et al. [41] with small feature dimension is feasible for RmsdXNA.Further modifications to the atom representation and the machine learning algorithm were performed to better suit NA-ligand datasets.RmsdXNA is applicable to diverse complexes involving different types of NA receptors (RNA, DNA, RNA-DNA hybrid, and NA with modified residues) and ligands, including metal complexes and short peptides.The best performing RmsdXNA model achieved an average Pearson correlation coefficient (PCC) of 0.64531 ± 0.36262 and Spearman's rank correlation coefficient (SRCC) of 0.58465 ± 0.35195 for each ligand.RmsdXNA outperforms the rDock [18] scoring function in its ability to rank and identify closest-tonative poses across different structural groups and on the testing dataset.This was also validated using the in-vitro experimental results by Swain et al. [42] on PAN ENE triple helix RNA.For MALAT1 virtual screening, MD simulations revealed a higher success rate for RmsdXNA in identifying the most promising ligands.These findings solidify RmsdXNA as a robust and effective scoring tool for pose identification and ligand screening in drug discovery applications.

Dataset preparation
The dataset used in the experiments consists of experimentally solved NA structures obtained from the Nucleic Acid Database (NDB) [43,44], which contains 11 821 NA structures including both DNA and RNA.PyMol [45] was used to process the structural information.Additional data cleaning and refinement procedures are described in Appendix B. In total, 1434 ligands from 980 NA complexes, with some structures having multiple ligands, were used in the experiment.The use of a large and diverse database allows the model to learn a comprehensive representation of the underlying patterns and relationships in the data, leading to better generalization to unseen examples.The list of PDB structures used is found in Appendix A.

Docking tools
Each ligand forms a receptor-ligand pair in each complex.Rigid NA-f lexible ligand docking was performed using rDocK [18] at the target search box, with the native ligand as the reference ligand.rDock was chosen because it outperforms other docking tools in pose identification for NA complexes [19,46].The docking pose search algorithm of rDock is similar to that of AutoDock [47], which uses the Monte Carlo simulated annealing method to generate random poses with unique configurations in each dock attempt.This data augmentation method can mitigate the limitation of the small NA-ligand dataset by increasing the number of datapoints with generated poses.Using rDock, the rbcavity command was used to generate the docking volume, with the docking radius set to 10 Å.For each redocking run, 100 runs-per-ligand rDock jobs were performed using the rbdock command.Only the docking poses with symmetry-corrected RMSD values ≤ 10 Å, measured using sPyRMSD [48], were considered as a successful dock and were included in the training dataset.100 poses were selected from each ligand that had at least 25% successful docking rate (able to generate at least 100 successful dock poses out of 400 dock attempts) to be included into the dataset.

Feature generation
The workf low of dataset generation and model training for Rms-dXNA to predict the RMSD of the docked poses is shown in Figure 1.For each distinctive interaction between the receptor and docked ligand atoms, the interatomic distance and atom types are used as features.In the context of ML scoring functions for NAs, the quantity of structures within the dataset is notably lower than that of the proteins.Reducing the number of employed features is crucial to prevent overfitting, the curse of dimensionality and data sparsity.Previous studies such as OnionNet [49] and OnionNet2 [50] adopted the count of contact points at each shell as their feature metric, whereas RNAposers [35] employed the receptor's atom name to generate distance-based features.However, these methodologies create an abundance of features, resulting in an unfavorable high feature-to-instance ratio dataset.On the other hand, the coarse-grain representation of the RNA and ligand structures [51,52] in AnnapuRNA [36] might sacrifice atomiclevel details, especially for non-canonical residues, potentially compromising the model's ability to capture specific interactions.
In RmsdXNA, the receptor and ligand SYBYL atom type representation is used as the interaction label.This generates fewer features in the dataset and allows the retention of important atomic-level interaction information such as pistacking and hydrogen bonding interactions.The remaining challenge involves combining each pairwise distance into a structured dataset suitable for training the ML model, while simultaneously retaining interaction proximity and frequency information.The distance is used to model the intermolecular potential energy between the receptor and the ligand molecule, which is represented by the sum of the coulomb and Lennard-Jones interaction terms, as shown in Equation 1.Such physicsinspired distance features were explored by Wang et al. [41] in protein-ligand interactions.
The values of the interaction constants are unknown and are factorized to form constants a, b and c.Equation 1 can be simplified to form Equation 2 by factorizing the constants.For a given receptor-ligand structure, the intermolecular distance between 2 atoms, r ij , can be easily obtained from the structural coordinates.ML is then used to determine the arbitrary constants.Assuming that there is a correlation between the RMSD and intermolecular forces, the RMSD can be predicted using the above equations.The intuition behind this assumption is that the closer-to-native ligand pose (smaller RMSD) should have stronger intermolecular interaction with NA than other poses.For each docked ligand pose, the interaction feature values between receptor atom i and ligand atom j similar to Equation 2 are represented by F(f (r ij )), such that: The NA residues can be classified into five generalized types, namely adenine (A), uracil (U) or thymine (T), cytosine (C), guanine (G), and ribose and phosphate backbone (N).The nucleobases of DNA and modified residue nucleobases are grouped into 4 main RNA nucleobases (A, U, G, C), depending on which nucleobases the residues are mutated from or have the closest structural similarity with (Appendix D), with thymine being in the same group as uracil.The NA SYBYL atom types are determined based on their position on these residues, and atoms in the backbone have residue type 'N' (Appendix E).The atoms in the modified region of the residues have residue type 'OTH'.These NA atoms and some ligand atoms with low occurrence are grouped by their chemical similarity, as described in Appendix C. The grouping of the residues and atom types reduces the sparsity of the dataset and the number of features in the dataset by binning the columns with a high number of 0-values, hence improving the generalizability of the model and making it applicable to datasets with modified residues.In total, 24 ligand atom types and 26 NA atom types (unique combinations of the residues and its corresponding atom types) were used, resulting in 624 pairs of interactions.Subsequently, columns with all 0 values were removed from the dataset, thus a final 529 pairs of interactions were included.
In the ablation studies performed in Section 3.2, the dataset that yielded the best result used a cutoff distance, R c , of 8 Å, along with the combination of the 1088 F( 1 r ) and F( 1 r 6 ) terms with the 30 rDock scoring function component terms for the features.It is noted that the model can be used to predict the RMSD of poses generated by other docking tools as the rDock scoring function components can also be obtained from the initial poses.

Machine learning models
A supervised learning algorithm is used to train the ML model to predict the RMSD of docked poses, which can also be utilized as the docking score.For this regression prediction task, the XGBoost algorithm was chosen because of its fast training speed and strong predictive capabilities in comparison with other ML models such as Support Vector Regression, Multi-layer Perceptron, and Random Forest algorithms as shown in Table A4.The superior performance of XGBoost over other algorithms may be due to the sparsity of the dataset.To ensure that datasets with the same PDB ID are not present in both the training and validation datasets simultaneously, systematic sampling is employed on the sorted PDB IDs to select the validating PDB ID at regular intervals.The data points are then split into training and validation datasets by their PDB ID.For each dataset, 100 models with randomized hyperparameter values are trained using five-fold cross-validation to assess the models' performance across different data subsets, which provides a reliable assessment of the model's generalization capabilities.R 2 values (Equation 3) are calculated for each fold.The model with the highest average validating R 2 across all folds, R 2 val , which indicates a strong correlation between predicted and actual RMSD, was selected as the best model.The model is trained with the learning objective of minimizing the regression squared loss and the hyperparameters of the best model are shown in Appendix F.
where: y = Actual RMSD of pose ŷ = Predicted RMSD of pose ȳ = Average actual RMSD of all poses

Validation using experimental result and MD simulations
Unlike protein-ligand complexes, there is no CASF-2016 [53] equivalent dataset for evaluating NA complex docking tools.Instead, the comparison of the scoring power of RmsdXNA and rDock is validated on experimental results and using MD simulation.
In the microscale thermophoresis (MST) experiment by Swain et al. [42] on GC  Using the MST results, an ideal docking score that can distinguish strong and weak binders should be able to accurately rank the compounds by their binding affinity and give the best ranking to Compound 15.Rigid receptor-f lexible ligand docking of the compounds was performed using rDock on the dinucleotide bulge position of GC PAN triplex (PDB ID: 6X5N), generating 100 poses for each compound.Then, RmsdXNA and rDock scores were used to rank the compounds.
MD simulations can offer insights into the dynamic behavior of receptor-ligand complexes, thus providing a more accurate assessment of ligand trajectories stability compared to docking alone [54,55].In this VS example, rDock is utilized to generate 20 poses within a 20 Å target search box on the 3' end of MALAT1 [56,57] (PDB ID: 4PLX) against two screening compound libraries, Hit2Lead (provided by Chembridge Corp. in San Diego, CA, USA, http://www.hit2lead.com)and Life Chemicals RNA screening library (https://www.lifechemicals.com).From each library, the top 20 ligands with the best rDock and RmsdXNA scores were chosen as the top-scored ligands for each method.Then, MD simulation (described in Appendix H) was conducted on the MALAT1-top ligand complexes to assess the reliability of the docking scoring methods through visual inspection the ligands' stability in the receptor during the 100 ns trajectory [58].

Model performance
The best model obtained from the hyperparameter tuning procedure had a R 2 val of 0.68301 ± 0.02330.The predicted RMSD values of each docked ligand pose in the validation dataset were combined across all folds to obtain the overall predicted RMSD for the entire dataset.Since the validation data remain unseen by the model in each fold, the obtained predictions accurately represent the model's generalizability.Figure 2(a) shows the comparison of the predicted RMSD with the actual RMSD of the docked poses, which has a strong correlation coefficient (R) of 0.83051 ± 0.00115.This finding underscores the RmsdXNA's ability to accurately predict RMSD values.
An effective score function should assign lower scores (better ranking) to the closest-to-native pose, which has a lower actual RMSD than the decoy poses.Therefore, the PCC (Equation 4) and SRCC (Equation 5) between the actual RMSD and scoring functions for each NA-ligand pair, which measure the correlations and ranking capabilities, respectively, are key metrics for evaluating their performance.Figure 2(b) and 2(c) illustrates each ligand's distribution of the PCC and SRCC values respectively for both scoring methods.RmsdXNA has a higher average PCC and SRCC than rDock.In addition, RmsdXNA outperforms rDock for 1271/1434 (88.6%) and 1254/1434 (87.4%) ligands in terms of PCC and SRCC respectively.This comparison revealed that RmsdXNA has a stronger correlation and ranking capabilities than rDock for most of the ligand structures. where: x = Actual RMSD of pose x = Average actual RMSD of all poses s = Score of pose s = Average score of all poses where: R(x) = ranking of pose based on RMSD (x) R(s) = ranking of pose based on score (s) n = number of poses The ability of RmsdXNA and rDock to select close-to-native poses, defined by poses with RMSD ≤ 2.0, was also analyzed.Figure 2(d) shows that the fraction of ligands that contained closeto-native poses in the top-n-scored poses was greater for Rms-dXNA.Additionally, out of all the ligands that are able to generate at least 1 close-to-native pose, 69.1%, 81.9% and 86.9% of the ligands for RmsdXNA and 45.9%, 62.0% and 70.8% of the ligands for rDock contained close-to-native poses in the top 1, 5, and 10 selected poses respectively.These results showed that RmsdXNA outperforms the rDock scoring function in pose identification.

Ablation studies
Ablation experiments were conducted to investigate the impact of various modifications to dataset generation on the performance of the model.The modified dataset was used to train the model using the steps described in Section 2.4.The model with the highest R 2 val for each modified dataset was subsequently selected and compared across the different modifications.
First, the effect of R c on dataset generation was examined and the results are summarized in Table A5.As R c varies from 4 to 10 Å, more interaction features are extracted and added to the dataset.New features with non-0 values may also be introduced into the dataset, increasing the number of columns of the dataset slightly.Previous studies by Szulc et al. [59] employed cutoff distances of 3.9 and 4.0 Å to detect hydrogen bonding and lipophilic interactions respectively in RNA complexes.However, R c = 8 Å gave the highest R 2 val of 0.68301 ± 0.02330.The improvement in the results from R c = 4 to 8 Å may be due to the increase in the number of non-0 values in the dataset and the addition of new information for the model.Beyond certain distances, such as R c = 9 and 10 Å, the added long-range and weak interaction may introduce noise to the data and additional information has insignificant contributions to the specific interaction between NA and ligand.Therefore, finding the optimal cutoff distance is vital for striking the right balance between capturing relevant interactions and avoiding unnecessary complexity in the dataset.
Next, the effect of adding the rDock scoring components, F( 1 r ), F( 1 r 6 ) and F( 1 r 12 ) features into the dataset was investigated.By generating datasets that combine these specific features, their contribution to the overall model performance can be evaluated and the best performing combinations can be determined.The results of the model with different combinations of features are shown in Table A6.The best result of R 2 val = 0.68301 ± 0.02330 was achieved when the rDock scoring function components were added to the F( 1 r ) and F( 1 r 6 ) features.The improved prediction accuracy suggests that the rDock components and the generated features complement each other effectively.Additionally, the analysis revealed that the F( 1 r 12 ) component is less important than the F( 1 r ) and F( 1 r 6 ) components, as the R 2 val of the model decreased to 0.67910 ± 0.01763 when the F( 1 r 12 ) component was added to the dataset.Wang et al. [41] explained that the short-range repulsive term, F( 1 r 12 ), is not relevant because of the low occurrence of smalldistance interactions in the NA-ligand complex.Although similar distance data are used to generate the feature values of and F( 1 r 12 ), the difference in the results suggests that proper data augmentation of the feature is needed to achieve high accuracy of the model.
Finally, the impact of different atom representation methods when generating the dataset on R 2 val is summarized in Table A7.The use of the SYBYL atom type to represent receptor and ligand atoms, element type to represent receptor atoms with the 'OTH' residue type, and other grouping methods mentioned in Appendix C, achieved the highest R 2 val of 0.68301 ± 0.02330.This outperformed models that used a full element or SYBYL atom representation with R 2 val values of 0.66726 ± 0.02244 and 0.67945 ± 0.01654 respectively.Compared to DeepRMSD, which uses element atom representation, using the SYBYL atom type can distinguish aromatic and non-aromatic atoms.This allows the model to capture pi-stacking interactions between the NA receptor and ligand which are abundant in NA complexes [60].Additionally, to handle sparse features involving NA modified residues atoms, binning them with other meaningful features can offer more instances for the model to learn from, while reducing the introduction of biased information.

Feature importance analysis
Feature importance analysis was conducted to determine the relative importance of individual features in predicting RMSD in the model.The following training process uses the hyperparameter obtained from the best model.For each fold, the model is used to predict the RMSD of the ligands from the modified validating dataset where a specific ij pair feature (F( 1 r ij ) and F( 1 )), or the rDock score component is artificially removed by setting all its values to 0. The change in the performance of the model was evaluated using R 2 , which represents the difference in R 2 between the original validation dataset and the dataset with the removed columns.This process is repeated for all columns, and the features that result in a decrease in R 2 are identified as important, whereas those resulting in an increase in R 2 are considered less important.This analysis is performed for each fold, allowing the observation of feature importance across different iterations.The analysis showed that certain features, such as aromatic interactions represented by the (U N.ar , N.ar), (A N.pl3 , N.ar), (U C.ar , N.ar) and (G C.ar , C.ar) features, have a significant impact on the model's predictive capabilities as shown in Figure 3(a).On the other hand, some features yielded positive R 2 values as shown in Figure 3(b), indicating that these features negatively affect the accuracy of the model.Despite the recognized importance of polar interactions in NA-ligand systems, the SCORE.INTER.POLAR term, which represents the intermolecular polar interaction score from rDock, showed the lowest importance.This may be due to its redundancy with the more effective capture of similar polar interactions by F( 1 r ) and F( 1 r 6 ) terms, potentially rendering the SCORE.INTER.POLAR term obsolete.The other low importance features are typically sparse in the dataset.For example, all the features involving highly sparse selenium (Se) atom interactions, with an average of 0.0494% non-0 instances in the dataset, did not reduce the accuracy of the model when they were removed.
The RmsdXNA model incorporates additional interaction terms that are not typically included in other ML models for NA-ligand systems, such as interactions involving metal and modified residue atoms.According to our feature analysis, the average R 2 of features containing receptor metals, ligand metals and modified residue atom (atoms with 'OTH' residues) interactions

Model performance across data points
The performance of RmsdXNA was assessed by examining its PCC values with respect to different ligand structures.This study aims to understand whether ligands with similar PCC values share common structural features that contribute to their performance variation.The structures of the ligands with the highest and lowest PCC values were selected for analysis.
Generally, RmsdXNA tends to give lower scores (better rankings) to poses that have more interaction contact with the receptor, such as poses that are deeper in the receptor surface pocket or intercalate between receptor surfaces.Therefore, scoring accuracy decreases for those in which the docking pocket deviates from the native ligand position, such as 2HO6 and 1M69 (Figure 4(a) and 4(b)).In the case of 2HO6, the docked poses from the 2 nearby ligands are generated within the same pocket.However, the results show significant disparity, with one achieving a PCC of 0.958, and the other achieving a PCC of -0.918.This poor performance may be caused by the model using the wrong native ligand pose as a reference when predicting the RMSD of the docked pose.Therefore, removing nearby ligands from the dataset is important for preventing such situations.
In addition, ligands with fused aromatic ring systems that are between the receptor residues tend to have low PCCs, such as 7N7D and 7MSV (Figure 4(c) and Figure 4(d) respectively).From the structure of 7N7D, the ligands that are on the outer surface of the receptor demonstrated better performance than those that are between the receptor surface.This observation suggests that it is difficult for the model to accurately predict the RMSD of docking poses for ligands with extended aromatic rings.One possible explanation is the excessive number of interactions surrounding the ligand, which could introduce noise into the model.
Conversely, ligands with hypoxanthine-like and ruthenium polypyridyl-like complex structures tend to give better results (Table A8).This could be due to the high frequency of such ligands in the dataset, which allows the model to gain better prediction capabilities and understanding of the interactions for this ligand class.Therefore, the performance of the model can be improved by providing a more diverse training dataset.
The ranking capability of RmsdXNA was analyzed for different groups of structures, such as structures solved using NMR vs. X-ray diffraction, and complexes involving DNA, RNA or RNA-DNA hybrid receptors.The average PCC of the ligands for each different group is shown in Table 1.For all the different groups of structures, the PCC and SRCC of RmsdXNA outperformed that of rDock.These findings reinforce the reliability and robustness of RmsdXNA across the different groups.

Evaluation on the test dataset
The model is evaluated using a testing dataset consisting of 125 out of 140 PDB structures (Appendix N) from various sources, Yan [46], Ruiz [18], Chen [61] and Philips [62], similar to the test dataset used in Feng et al. [19].These datasets primarily comprise complexes of DNA and RNA receptors with small molecules and peptide ligands.The remaining 15 PDB structures in the testing dataset were not used because of the selection criteria, as described in Appendix B. In addition, the remaining 1309 PDB structures from the full dataset were used for training the model using the best model's hyperparameter values.
The results of the evaluation of RmsdXNA's prediction on the testing datasets are shown in Figure 5.For all the testing datasets, a strong correlation between the predicted RMSD and the actual RMSD of the docked poses of 0.67228 to 0.85120 was obtained.Based on the PCC and SRCC values of each ligand and the fraction of ligands for which the top-n-pose had an RMSD ≤ 2.0 Å, Rms-dXNA outperformed rDock in terms of ranking capability and pose selection.The ability of each docking tool in selecting close-tonative poses in the top scoring poses was also compared.For PDB structures that were not found in our dataset, it is assumed that RmsdXNA is unable to predict the close-to-native pose for that ligand.For PDB structures containing multiple ligands, the pose  that gave the lowest RMSD was used for evaluation.RmsdXNA achieved the highest success rate compared to the other tools for all the testing datasets as shown in Table 2. Overall, RmsdXNA's scoring accuracy outperforms that of existing docking tools for various unseen datasets and it is more capable of identifying the correct docking pose for NA-ligand complexes.

Scoring power comparison validated by MST experiment
The comparison between rDock and RmsdXNA scores was validated using the in-vitro results of MST on GC PAN triplex by Swain et al. [42].The scores of the best poses selected by each scoring function are summarized in  6, differentiated by the direction that the extended five carbon ring is pointing: it is pointing away from the receptor for RmsdXNA, but pointing towards the receptor for rDock's best pose.The MD trajectory of Compound 15 is more stable when using the initial pose selected by RmsdXNA, indicating a closer resemblance between the actual binding mode of Compound 15 and the best pose selected by RmsdXNA.The difference in the stability of the ligand trajectories highlights the importance of selecting initial structures using the docking score to enhance the reliability of MD simulations.

Screening application validated with MD simulations
The screening power of RmsdXNA and rDock for VS application on MALAT1 was compared by evaluating the MD trajectories of the top ligands selected by each scoring function.RmsdXNA was able to identify about 6 times more stable ligands out of 20 top-scored ligands compared to rDock as shown in Table 3.This finding suggests that RmsdXNA has a higher success rate in identifying potential ligands that binds strongly to the receptor, making it more accurate and reliable for screening.Interestingly, no common top candidate ligands were identified by RmsdXNA and rDock.From the top poses selected by RmsdXNA and rDock as illustrated in Figure 7, it is observed that RmsdXNA tends to give better scores to poses that intercalate between the RNA residues, but no such poses are selected by rDock.This indicates that a Table 2: Number of PDB structures where the top 1 or top 5 poses selected by the different docking programs in f lexible-ligand docking, contain poses with RMSD ≤ 2 Å on the 4 test sets of diverse NA-ligand complexes.The results from NLDock, AutoDock, rDock (NLDock) and DOCK 6 are obtained from Feng et al. [19].For RmsdXNA and rDock, only ligands that are selected as described in Section 2.1 are used in this evaluation.AU: Please provide suitable wording for the table footnote to give the meaning of the bold values given in Table 2   change in the score function can result in different outputs.The ability to accurately identify stable ligands is crucial for selecting reliable docking tools and refining computational docking protocols.By improving the efficiency and success rate of VS-related efforts in drug discovery, RmsdXNA can greatly impact the field.

Limitations
In this study, RmsdXNA was evaluated and trained on ligand poses generated by local docking, excluding poses with RMSD ≥ 10 Åfrom the native pose.This exclusion may limit the functionality of RmsdXNA when the docking pocket is unknown, but it is necessary to reduce noise in the dataset during training and evaluation [63].RmsdXNA is expected to perform well if docking tools can accurately generate close-to-native ligand poses or address the f lexibility issue of the receptor as discussed in Stefaniak and Bujnicki [36].Furthermore, tools that identify RNA-ligand binding sites [64], can be used in conjunction with RmsdXNA to accurately identify the docking pocket.
The small number of experimentally determined NA-ligand complexes presents a challenge for improving the ML model  for NA docking tools.Infrequent feature interaction also hinder the model from learning this information effectively.However, the advantages of easy feature extraction and adaptability to different datasets presented in RmsdXNA simplify the addition of new data points to enhance the accuracy of the model.While RmsdXNA demonstrated overall better results than rDock in validation against GC PAN triplex MST experiment and MD simulations validation of MALAT1 VS, a wider variety of experimental dataset involving diverse receptors is required to comprehensively evaluate the reliability of the scoring functions for virtual screening and pose identification.

CONCLUSION
RmsdXNA, a ML regression model, was successfully developed and demonstrated to be an effective tool for predicting the RMSD of docked ligand poses in NA complexes.The model incorporates distance-based features that capture atom-atom interactions in NA complexes, including those containing metal ions, modified residues, and peptide ligands.The experiments showed that there is a correlation between the binding affinity and the predicted RMSD of a ligand pose, which can be used for pose identification and screening.RmsdXNA outperformed the other docking tools when evaluated on the testing dataset.It also achieved high prediction accuracy with an average validation R 2 of 0.68301 ± 0.02330 and an average PCC of 0.64531 ± 0.36262 and SRCC of 0.58465 ± 0.3519 across the different ligands.The alignment of the scoring ranking with the result of the MST experiments conducted on GC PAN triplex validated RmsdXNA's better capability than rDock in distinguishing strong binders from weaker binders.RmsdXNA also has a higher success rate in identifying good ligand candidates for VS, supported by the MD simulations of MALAT1.Overall, this study demonstrated that ML can be utilized to improve the accuracy of molecular docking methodologies in the field of computational drug design.

A Appendix Appendix A. PDB list
The list of PDB structures from NDB used for docking are shown in the list below.The PDB IDs in red are not used in the dataset as the ligands were docked unsuccessfully, where there are fewer than 100 poses out of 400 dock attempts with RMSD ≤ 10Å.

Figure 1 .
Figure 1.Overall architecture of the RmsdXNA framework.(A) Inter-molecular distance between a NA receptor atom and the ligand atom within a cutoff distance, R c is extracted for each generated pose.(B) The distance is used to form the F( 1 r ) and F( 1 r 6 ) terms, which will be concatenate with the rDock scoring components to form the features, X.The label, y is the actual RMSD of the docked poses.(C) XGBoostRegressor is used to train on the dataset predict the RMSD of the docked poses.

Figure 2 .
Figure 2. Performance of RmsdXNA and its comparison with rDock on the full validation dataset.

Figure 3 .
Figure 3. Change in R 2 val between the original validating dataset and the dataset with the removed column, R 2 val .Features with R 2 val < 0 in Figure 2(a) represents a poorer performance of the model after removing the feature, while features with R 2 val > 0 in Figure 3(b) represents an improvement in performance of the model after removing the feature.

Figure 4 .
Figure 4. Ligands with high PCC values of the actual vs. predicted RMSD shown in red.For structures that contain multiple ligands, the ligands with high PCC values are shown in green.

Figure 5 .
Figure 5. Result of the testing dataset.Figures in column a) shows the predicted vs. actual RMSD of the ligand pose; b) shows the PCC distribution of the ligands; c) shows the SRCC distribution of the ligands; d) shows the fraction of ligands in the top-scored n poses that contains the pose with the lowest RMSD.Row 1, 2, 3 and 4 represents results from the Yan, Ruiz, Chen and Philips dataset respectively.

Figure 7 .
Figure 7. Poses of the top 20 ligand selected by RmsdXNA and rDock from Chembridge and Life Chemicals screening libraries docked on 4PLX using rDock.Top-scored ligand poses selected by RmsdXNA tend to intercalate between the RNA residues.This is not observed for the top-scored ligands selected by rDock.

Table 1 :
The average PCC and SRCC of the actual RMSD vs. RmsdXNA and rDock's scoring functions of the poses of each ligands are summarized in this table.The PCC and SRCC are categorized by their structure solving method using NMR and X-ray diffraction, and by their receptor types of DNA, RNA and RNA-DNA hybrid, with bold values representing better performance among the scoring functions.

Table A10
the scores of Compound 15 and its modification counterparts, RmsdXNA ranks Compound 15 at a better rank of 2/4, whereas rDock rank Compound 15 at 4/4.These results validate that RmsdXNA's ability to distinguish strong from weak binders and its ligand screening capability is better than that of rDock.The best poses of Compound 15 selected by RmsdXNA and rDock are shown in Figure directly in the text.

Table 3 :
Number of stable ligands from the 20 top-scored ligands selected by RmsdXNA and rDock during the MD simulation of 100 ns.The Chembridge and Life Chemicals screening compound libraries were used for the VS.

Table A9 :
The full number of PDB structures in the NA-ligand test dataset and the number of ligands that were used in our dataset.